SSORT.REM[UP,DOC] - www.SailDart.org

perm filename SSORT.REM[UP,DOC] blob sn#405871 filedate 1978-12-26 generic text, type C, neo UTF8
COMMENT ⊗   VALID 00002 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00002 00002		SHORT WRITEUP FOR SSORT.DMP[1,3], THE VARIABLE-LENGTH STRING SORT/MERGE
C00020 ENDMK
C⊗;
	SHORT WRITEUP FOR SSORT.DMP[1,3], THE VARIABLE-LENGTH STRING SORT/MERGE
 
	SSORT means String-SORT/merge.  It takes an input file,
breaks it into records according to some particular delimiting
character, sorts the records into ASCII-sequential order (note,
the L switch modifies this sequence), and assembles an output
file with the delimiters restored or replaced by a new delimiter.
It uses simulation of paging, keeping data on disk (or in cache)
except when being sorted, and is fairly efficient when thrashing
of the cache does not occur.

COMMAND TO INVOKE THE PROGRAM IS R SSORT.
INPUT FILE SHOULDN'T HAVE SOS LINE NUMBERS NOR TV/E DIRECTORY, BUT IT MIGHT
WORK ANYWAY.  OUTPUT FILE WON'T OVERWRITE AN EXISTING FILE UNLESS YOU SAY YES.
A TEMPORARY FILE WILL BE CREATED AND DELETED, CALLED SOMETHING LIKE <JOBNUM>SORT.TMP

The file is assumed to be broken up into frames by a special delimeter
character, which may be any ASCII character you choose, except nulls which are
usually ignored.  There is a special feature called the "JMC" kludge which
allows two consecutive delimiters rather than one to be the delimiter,
thus allowing a blank line between paragraphs to be the frame separator if
linefeed is the delimiter and carriage-return is deleted upon input and
replaced upon output.

When you start up the program, it will type * and if you just type <crlf>
it will remind you of the command format and give you a list of switches.
The command format is <OUTPUT-FILE>←<INPUT-FILE><SWITCHES><CRLF>.
Switches are parsed from left to right, and the effect of the default
switches is to place those switches ahead of whatever switches you type.
Each of the /A /O /B switches overrides any previous delimiter specification
and also resets any /D or /S to the usual values (i.e. /D<nil> /S<delim>).
The /F and /R merely turn on switches to the program, the /D switch
appends to the list of /D characters, and the /S replaces the list of /S
characters.

/A<KH> and /O<OCTAL NUMBER> are used to specify the delimiter character, depending on
  whether you want to use the character itself or the octal value for it
  (note, for activation characters, you must use the /O form).

/B causes linefeed to be the delimiter, <cr> to be ignored but patched in
  when you are done, and the special double-occurrance flag (the JMC feature)
  to be turned on.  If you want the double-occ flag but some other delimiter,
  it might be possible to do /B followed by /A or /O, but this hasn't been
  tested as of 1975.APR.22 because no one has had any use for it.

/F causes the program to assume the delimiters are at the front rather than
  the usual rear-end of each frame.  The major effect is at the start and
  end of the file where an extra blank frame or a lost frame might occur.
  This should not be done if linefeed is the delimiter because the last
  record (line) will be unreadable by many text-processing programs.

/R causes retention of any duplicate frames.  Normally, extra copies are
  purged.  There is no way at present to count the duplicated frames (such
  as when counting words in a dictionary).  There is no way to output only the
  duplicated frames, but a SRCCOM between the output with /R in effect and
  the output without /R will somewhat perform this function.

/D<OCTAL NUMBER> causes that character to disappear during input.  You
  may still resupply it upon output by /S.  Carriage-return is normally
  handled this way, by SSORT (using default switches) and other SU-AI programs.

/S<OCTAL NUMBER LIST> causes that string (up to 5 characters) to be used
  instead of your normal delimiter upon output.

/L (temporary kludge largely replaced by /U) causes the sorting
sequence to be modified so that A < a < B < b < C ... z < [
i.e. the lower-case characters are moved to be just greater
than their upperifications.

/U causes lower-case characters to be modified during actual
sorting to be upper-case equivalents, without modifying the text
in the actual records.  Thus, in the absense of the /R switch,
"Aardvark" and "aardVARK" would be considered identical and exactly
one (not supported which one!) would be selected for output; with
/R they would both appear but not in any particular order.  /U is
appropriate for sorting most indexes and bibliographies, providing
that no formatting characters are in some records and not in others
to screw up the sorting sequence.

The default switches  for  SSORT  are  /O12/D15/D14/S15&12  or
 something  close  enough  that  you  will  understand it from
 reading this.  /O12 means linefeed is delimiter.    /D15  and
 /D14 mean that all carriage-returns and formfeeds are deleted
 during the first input phase.  Thus each record stored inside
 SSORT  will  consist  of  pure  text without any <cr> or <ff>
 (purged)  or   <lf>   (each   linefeed   is   replaced   with
 end-of-record  mark  inside  SSORT).  /S15&12 means that upon
 final output the delimiter (which  was  linefeed  originally,
 with  a  carriage-return  in  front  of  it  most  likely) is
 replaced  by  15&12  which  is  carriage-return  followed  by
 linefeed.    Thus normal SU-AI files will be put back the way
 they were (except that they will end up alphabetized, and all
 page-marks   will   be  removed)  and  other  files  will  be
 "fixed-up" into SU-AI format.

TYPICAL USES:
	A  file  of single-line items, such as FORTUN.TXT[2,2]
 --  Use  default  switches  and  the  file  will  be   sorted
 line-for-line.  Indeed, SSORT was actually used on FORTUN.TXT
 which is why it has been in alphabetical order since 1972.
	A  bibliography consisting of paragraphs of text, with
 the significant key at the front of  each  paragraph,  and  a
 blank  line  between  paragraphs  -- Use the /B switch.  Each
 paragraph will be considered an  indivisible  unit,  and  the
 paragraphs will be put in correct sequence.
	A bibliography like above, but where case of letters
is to be ignored -- Use /B/U.
	REM's  diary,  consisting  of  items  preceeded  by  a
 year-month-date  field  that  consists  of  the  character  ∨
 followed by the other info. ∨ does not occur elsewhere in the
 file,  so  it  can  be  used as a delimiter without any other
 scanning. -- Use the switches /A∨/F which  establishes  ∨  as
 delimiter and assumes it is at the beginning of each record.
	A bibliography like described above but with  the  key
 not  the first thing in each record. -- SSORT is not equipped
 to handle this.  You must reformat the  file,  using  SOS  or
 TECO or a specially-written program, usually written in SAIL,
 so that the key is at the start of  each  record.   Then  run
 SSORT.  Then reformat the file back into the desired format.

HOW THIS PROGRAM COMPARES WITH IBM SORT/MERGE ET AL.
	SSORT is the only existing sort/merge program that can
 handle  arbitrarily-long  strings in random format.  IBM only
 handles strings whose length is  known  a-priori.   IBM  does
 allow  the  key  to  be  in  any  fixed location, but again I
 believe the number of characters from the start of the record
 to  the  start of the key must be a constant, which is unlike
 anything we do here.
	Martin  Frost  has a fast sorter for Balkan Folk Dance
 files, lists of records in the Stanford  Folk-Dancers  record
 collection, however the files must be specially formatted for
 the sorter to  work,  and  I  believe  the  maximum  size  is
 restricted and the locations of the fields are almost fixed.
	MIT TECO has a command for performing sort on the file
 under  edit.   It allows variable-length strings and the user
 can define operations to  locate  the  key  anywhere  in  the
 string  by  an arbitrary set of three search commands, one to
 move from the start of record to start of key,  one  to  move
 from  start to end of key, and one to move from end of key to
 end of record (beginning of next record). It is not protected
 against   the   user  defining  contradictory  (unreasonable)
 searches, which it will dutifuly perform until you run out  of
 core  or call back the monitor after waiting forever while it
 sits in a run loop.  For more info, consult MRC, who informs
 me it is efficient.  Unfortunately, MIT-TECO isn't yet here
 at SU-AI (as far as I know).

	The  method SSORT uses is a HEAPSORT on the first word
(5 characters) of each record, followed by  HEAPSORTs  on  the
n+1st  words  for  each batch of records that are identical up
through  the  nth  words.   If  it  reaches  END-OF-RECORD  in
matching  records,  then  all  but  one copy of each identical
record are purged, except when the /R switch is enabled.   The
actual  data  is kept on the disk, with a word of exactly 0 to
signify END-OF-RECORD.   Since nulls are always purged  during
the initial parse, this works.
	To reduce the  number  of  disk  accesses,  a  virtual
memory system is simulated in core.  Thus if the file is small
then almost the entire sort runs in core, although  with  much
software  overhead. -- If the file is large but partly-sorted,
then paging is very efficient, consisting of a simulation of a
pure  merge operation with buffered-mode input on each segment
that is in ascending sequence.  (This cannot be done easily on
this  system any other way since the number of I/O channels is
limited  to  '20  whereas   SSORT   can   simulate   as   many
single-buffered input channels as there are pages simulated in
core, providing that the hashing of  disk  record  numbers  is
uniform in bucket space).
	If the file  is  large  and  not  sorted  at  all  but
consists  of  a  reasonably-small number of large records then
the paging is very efficient,  consisting  of  simulating  one
bufferred-input for each record during the decide-the-sequence
phase, and consisting of random-access input  with  one  USETI
per record during the final output phase.
	If most of the records can be distinguished  by  their
first  five characters, then the first pass of sorting in core
does most of the job, so that almost no paging must  be  done.
If the file consists of short records, however, the final pass
to write out the records will be many USETIs, one per  record,
and hence will be slow although linear in time with respect to
the number of records.
	If the file consists of a very large number of similar
and short records, given in totally random sequence (entropic,
not  sorted) then thrashing will occur and it will take a very
very long time to do the job.  In  that  case  I  suggest  you
break  the file up into pieces that can be sorted in core, and
sort each piece by itself, then concatinate the pieces into  a
large  file  and do one last sort on it (the last pass will be
the case of simulating a pure-merge, so it will be efficient).

If you find an error in this writeup, please add an addenda at
the end of this file:

ADDENDUM by DON: In the above paragraph, "addenda" should be "addendum".

ADDENDUM by RAK: DON is wrong; suppose you find two things wrong?
However "an" should be deleted.